Behavior Research Methods — Latest Matching Preprints

1

Initial Technical and Clinical Validation of Mobile Pupillometry with Virtual Reality: A Digital Biomarker for Screening Cognitive Function and Impairment

Brendler, A.; Fietz, J.; Bauer, A.; Pfahl, D.; Higgins, S.; Vidovic, E.; Brueckl, T.; BeCOME Working Group, ; Memory Clinic Working Group, ; Hupe, K.; Knop, M.; Spoormaker, V. I.

2026-07-17 neurology 10.64898/2026.07.15.26358187 medRxiv

Top 0.1%

10.8%

Show abstract

Cognitive impairment is a prevalent symptom extending from physiological ageing to disease. It commonly manifests itself in initial memory problems, progressing and co-occurring in more severe conditions such as Mild Cognitive Impairment, Alzheimer's Disease and Major Depressive Disorder. However, current non-invasive screening assessments either lack biological information or are invasive and restricted to specialized centers with complex and cost-intensive set-ups. Here, we conducted an initial validation of mobile pupillometry with Virtual Reality (VR) under experimental conditions as a digital biomarker for cognitive impairment by testing required biomarker-specific properties. For this purpose, we first assessed its construct validity by testing healthy participants (n=43) on an n-back task in VR while pupil size was measured. Mixed effects models revealed that similar to lab-based eye-tracking systems, pupil size increased in a sensible and distinguishable fashion as a function of working memory load. Second, to test the signal's reliability, the same participants were tested on the identical set-up two to three months after their first visit. We observed that the pupil response profile was highly stable over this period. Third, for its clinical validity, we examined patients (n=89) from three different cohorts with varying degrees of cognitive impairment and compared them to healthy control participants (n=81). Mixed-effects models indicated that pupil size was reduced as a function of cognitive impairment levels at higher cognitive load and that this effect was stronger pronounced with increasing age. In conclusion, we provide initial evidence for mobile pupillometry being a sensitive, reliable and clinically valid digital biomarker for cognitive functioning and impairment, which offers desirable properties due to its quick, automatized and location-independent set-up. Keywords: digital biomarker, mobile pupillometry, Virtual Reality, cognition, , Major Depressive Disorder, Mild Cognitive Impairment, Alzheimer's Disease

2

Evaluating Goodness of Pronunciation and Phonological Posteriors as Objective Markers of Speech Severity in Motor Speech Disorders

Wang, F.; Utianski, R. L.; Duffy, J. R.; Barnard, L. R.; Botha, H.

2026-07-16 neurology 10.64898/2026.07.14.26358076 medRxiv

Top 0.2%

2.7%

Show abstract

This study examined the extent to which goodness of pronunciation (GoP) scores and phonological posterior probabilities capture perceptual ratings of speech severity in individuals with motor speech disorders (MSD). Speech recordings of the word catastrophe were obtained from 489 participants, including 333 neurologically typical controls and 156 individuals with MSD. GoP scores were derived using traditional acoustic features and self-supervised speech representations, including WavLM and XLS-R, across multiple modeling approaches, while phonological posterior probabilities were extracted using Phonet. Model performance was evaluated using Kendall's rank correlations, regression, and receiver operating characteristic analyses against speech-language pathologists' perceptual ratings of sound distortion and intelligibility. Both GoP and phonological posterior probabilities were significantly associated with perceptual ratings. Self-supervised speech representations substantially outperformed traditional acoustic features, with WavLM-based GoP using k-nearest neighbors achieving the strongest performance. Across correlation, regression, and classification analyses, GoP consistently outperformed phonological posterior probabilities for both sound distortion and intelligibility. Age and gender had minimal influence on model-derived measures or their relationships with perceptual ratings. These findings demonstrate the value of self-supervised GoP as an objective measure of speech impairment while highlighting the complementary role of phonological posterior probabilities in characterizing articulatory aspects of motor speech disorders.

3

Boora, an AI-assisted digital platform for overweight and obesity care in Brazilian primary care: a formative mixed-methods evaluation of perceived usability and acceptability

Couto, F. d. F. S.; Almeida, C. P. B.

2026-07-16 primary care research 10.64898/2026.07.15.26358116 medRxiv

Top 0.7%

0.5%

Show abstract

Objective. To evaluate the perceived usability, acceptability, and user experience (rather than the clinical effectiveness) of Boora, an AI-assisted, human-supervised digital platform prototype for longitudinal overweight and obesity care, among users and health professionals in Brazilian primary care. Design. Convergent mixed-methods formative evaluation. Perceived usability was measured with the System Usability Scale (SUS) and summarised descriptively; semi-structured interviews conducted after hands-on use were analysed with codebook thematic analysis (Braun and Clarke); the two strands were integrated through a joint display. Qualitative reporting followed the Consolidated Criteria for Reporting Qualitative Research (COREQ). Setting. Primary health care network of Ananindeua, Para, within the Brazilian Unified Health System (January to February 2026). Participants. Fifteen adults with overweight or obesity (BMI at least 25 kg/m2, confirmed via electronic health records) who used the patient application on their own smartphones for 24 hours, and eight primary care professionals (nurses, physicians, and a dietitian) who used the professional dashboard for approximately 20 minutes on predefined tasks with synthetic data. Main outcome measures. SUS scores and qualitative themes addressing usability, acceptability, perceived usefulness, barriers, and perceived clinical and workflow fit. Results. Boora showed good perceived usability in both cohorts (users mean 76.5, SD 10.3; professionals mean 77.5, SD 4.6; both above the SUS normative average of 68). Four themes emerged per cohort. Users valued an accessible interface and visible progress but described daily logging burden, fragile anticipated engagement, and digital-literacy and accessibility barriers. Professionals valued a clear interface and the prospect of panel-managed, proactive follow-up, while requiring training, AI governance, protected time, and interoperability with the national record. Integration indicated that the disengagement users anticipated was the risk professionals perceived the dashboard could help identify, whereas the educational AI assistant was the weakest and most ambiguous component for both groups. Conclusions. Boora was perceived as usable and acceptable, with perceived value concentrated in human-supervised, longitudinal follow-up rather than autonomous self-tracking or AI advice. These findings concern perceived usability and acceptability, not clinical effectiveness or sustained engagement. Real-world adoption would depend on accessibility refinements, electronic-record integration, and clear AI governance aligned with the principles of Brazil's proposed risk-based AI framework and the LGPD.

4

Validity and Reliability of the Novel Indonesian Instrument for Aphasia Diagnosis (IDEA)

Prawiroharjo, P.; Fakhri, A.; Gabrielle, A.; Martalia, V.; Rahmayani, S. A.; Wijaya, V. G.

2026-07-19 neurology 10.64898/2026.07.17.26358303 medRxiv

Top 0.8%

0.4%

Show abstract

Aphasia diagnosis in Indonesia remains challenging due to limited culturally and linguistically appropriate instruments. Widely used tools such as the Boston Diagnostic Aphasia Examination (BDAE) and Western Aphasia Battery (WAB) are not adapted to the Indonesian context, while Tes Afasia untuk Diagnosis, Informasi, dan Rehabilitasi (TADIR) provides screening but lacks diagnostic accuracy. To address this gap, we developed the Instrumen Diagnosis dan Evaluasi Afasia (IDEA) for native Indonesian speakers and evaluated its validity, reliability, and normative cutoff values in cognitively healthy Indonesian adults. Eighty-three cognitively normal adults (screened using MoCA-Ina) with no history of neurological disease were assessed using IDEA, which evaluates six language domains. Items were adapted from existing tools and reviewed by experts. Content validity, internal consistency (Cronbachs alpha), and construct validity (Exploratory Factor Analysis) were analyzed using SPSS v25. A total of 83 participants were included (median age = 55.81 years, 54% secondary education). IDEA demonstrated good feasibility, with an average completion time of 45-60 minutes depending on participant engagement. Content validity was established by unanimous expert consensus. Construct validity showed meritorious sampling adequacy (KMO = .872) and significant sphericity (Bartletts test {chi}^2 (15) = 278.523, p<.001), supporting factor analysis. Internal consistency showed good reliability across six domains (Cronbachs = 0.896). IDEA is a valid and reliable tool for assessing aphasia in Indonesian natives. It is a culturally appropriate assessment tool which offers structured, domain-based evaluation and supports differential diagnosis of both classical and progressive aphasia syndromes. Keywords: Aphasia, Language Assessment, Indonesian, IDEA, Validity

5

Patient-Specific EEG Baseline Establishment Using the E-norms Method for Pediatric Seizure Detection Without Labeled Training Data

Jabre, J. F.

2026-07-16 neurology 10.64898/2026.07.13.26357876 medRxiv

Top 0.8%

0.4%

Show abstract

The aim of this work is to validate patient-specific EEG baseline establishment using the e-norms method as a screening and retrospective-review tool for seizure detection in pediatric epilepsy. The method was applied to 247 seizure-free EEG recordings (263.92 hours) from 10 patients in the CHB-MIT Scalp EEG Database (ages 3-18). A composite stability metric combining first-derivative dynamics, spectral entropy, variance, and line length was computed per 2-second epoch across 23 channels. Patient-specific detection thresholds were derived from each patient's seizure-free baseline using a weighted statistical procedure. Performance was validated against 72 expert-annotated seizures (2,705 epochs) across 62 seizure files, with durations spanning 6 to 264 seconds (44-fold range). The results show that detection achieved 94.4% event-level sensitivity (68 of 72 seizures; 95% CI 86.6-97.8%) and 81.5% epoch-level sensitivity (2,204 of 2,705 epochs; 95% CI 80.0-82.9%). Eight of ten patients achieved 100% event-level sensitivity with epoch-level sensitivity ranging from 58.7% to 100.0%. Two patients showed partial event-level failures (CHB-15: 17 of 20; CHB-18: 5 of 6), with the four missed events attributable to two characterizable failure modes. Patient-specific thresholds ranged from 4.06 to 4.81 (mean 4.51 +/- 0.25); threshold variation did not correlate reliably with age or sex, confirming that no universal threshold could achieve comparable performance. Detection margins ranged from 0.88 to 1.24 times. Patient-specific e-norms achieves 94.4% event-level sensitivity for pediatric EEG seizure detection without requiring labeled seizure training data, exceeding published human expert inter-rater agreement (50-76%) and recent automated approaches in adult cohorts using behind-the-ear EEG and wearable ECG. Two characterizable failure modes account for the four missed events and inform appropriate clinical use. As a high-sensitivity screening tool complementary to real-time alarm systems, the method is ready for adult validation, prospective deployment, and head-to-head benchmarking.

6

Automated Detection of Motor Speech Disorders and Subtype Classification

Wang, F.; Utianski, R. L.; Barnard, L. R.; Stricker, J. L.; Clark, H. M.; Meade, G. F.; Jones, D. T.; Whitwell, J. L.; Josephs, K. A.; Duffy, J. R.; Botha, H.

2026-07-19 neurology 10.64898/2026.07.16.26358268 medRxiv

Top 0.8%

0.4%

Show abstract

Motor speech disorders (MSDs) are early markers of neurological disease, but expert perceptual analysis is rarely available outside specialized centers. Automated speech analysis offers a scalable alternative, yet prior studies have not systematically compared modeling approaches or assessed clinically relevant metrics in independent datasets. This study compared static acoustic features, articulatory informed Phonet features, and self-supervised pretrained models for binary and multi label MSD classification. We trained and evaluated models on 583 speech samples using speaker level splits. Baseline models included logistic regression and Gated Recurrent Units (GRUs) trained on eGeMAPS and MFCCs. We extracted three types of Phonet derived features and evaluated pretrained HuBERT and SSAST models in frozen, partially fine-tuned, and fully fine-tuned configurations. Binary classification distinguished MSDs from controls, while multi label classification identified six MSD subtypes. Models were assessed using validation AUC, and cut points were tested on two independent datasets. Pretrained and Phonet based models substantially outperformed static acoustic features. In binary classification, HuBERT achieved the highest AUC (0.95), while compact Phonet derived GRUs achieved comparable performance (up to 0.94). These models generalized well to independent datasets, maintaining high sensitivity (0.94) and specificity (0.97). In multi label classification, Phonet models achieved the highest macro average AUC (0.86), but threshold-based subtype performance declined on unseen data. Automated MSD detection is feasible and clinically promising. Binary classification generalized well, whereas multi label classification showed limited threshold stability across datasets.

7

How Do Nurses Make Clinical Decisions Via Remote Reviews: A Convergent Mixed-Methods Study

Zhang, Y.; Sutherland, S.; GREENWAY, K.; Stayt, L.

2026-07-17 nursing 10.64898/2026.07.15.26357946 medRxiv

Top 1%

0.2%

Show abstract

Abstract Background: Remote clinical reviews have become an integral component of contemporary nursing practice across community and acute care settings. Nurses increasingly make autonomous clinical decisions using telephone, video, and online/digital systems, often with limited sensory information and under conditions of uncertainty. However, empirical understanding of how nurses make clinical decisions via remote reviews remains limited. Aim: To explore and understand how registered nurses (RNs) make clinical decisions about patient care via remote reviews. Methods: A convergent mixed-methods design was employed. Quantitative data (analytic quantitative sample N=53) were collected using validated questionnaires that measured decision-making processes, physician-nurse collaboration, decision-making stress, and perceived decision-making ability. Qualitative data (N=23) were generated through semi-structured interviews. Data collection took place between October 2024 and April 2025. Quantitative data were analysed using descriptive statistics, correlation, and multiple regression. Qualitative data were analysed using framework analysis. Integration was achieved through pillar-building and theory-driven synthesis and illustrated by joint display tables. Results: Most nurses demonstrated a flexible decision-making style, integrating analytical and intuitive reasoning. Both analytical and intuitive processes were positively associated with perceived decision-making ability. Physician-nurse collaboration emerged as a strong predictor of decision-making confidence, while decision-related stress was not a significant predictor. Qualitative findings identified three themes: characteristics of remote review; making adaptive decisions shaped by both internal and external constraints and enablers; and external influencing factors. The integrated findings informed a theory-informed ICE framework to illustrate how nurses make clinical decisions via remote reviews. Conclusion: Remote clinical decision-making is a dynamic cognitive-environmental process rather than a purely individual cognitive act. The ICE framework conceptualises this interaction, extending existing decision-making theories to digitally mediated care. Impact: Understanding remote decision-making supports training design, clinical governance, and the development of Artificial Intelligence-enhanced decision-support tools grounded in ecological bounded rationality. Patient or Public Contribution: Patient and public representatives contributed to stakeholder discussions that informed the development of the interview topic guide and the theoretical model. Patients or members of the public were not involved in recruitment, data collection, analysis, interpretation of findings, or preparation of the manuscript. Keywords: clinical decision-making, remote reviews, telehealth, nursing, mixed methods, ecological bounded rationality

8

Benchmarking Speech Recognition Models for Medical Consultations in Latin American Spanish: A Comparative Evaluation with Fine-Tuning

Carrillo, R. M.; Carbajal Serrano, A.; Condori Pinedo, P. S.

2026-07-16 public and global health 10.64898/2026.07.14.26358062 medRxiv

Top 1%

0.2%

Show abstract

BACKGROUND: Artificial intelligence (AI) medical scribes rely on speech-to-text (STT) models for transcription. Evaluations of STT models in non-English settings remain scarce. We benchmarked ten STT models on medical consultations from Latin American (LatAm) Spanish and assessed whether fine-tuning improves transcription accuracy. METHODS: Ten YouTube videos depicting medical consultations. Human transcriptions were the ground truth. Five open-source models were evaluated: Whisper Large, Whisper Large v3, Whisper Large v3 Turbo, Voxtral Mini 3B, and Canary 1B v2; and so were five close-source models: gpt-4o-transcribe, gpt-4o-mini-transcribe, gemini-2.5-pro, Eleven Labs, and Assembly AI. Whisper Large v3 was fine-tuned. One video was withheld from training. Performance assessed using Word Error Rate (WER), Character Error Rate (CER), BLEU Score, ROUGE-L, BERT Score, and Semantic Similarity on the one withheld video. RESULTS: None of the fine-tuning iterations outperformed the vanilla Whisper Large v3. With the withheld video, Gemini-2.5-pro was the close-source model with the best performance in four of six metrics. In comparison to the close-source models, the fine-tuned model never outperformed the other models (withheld video); conversely, in comparison to the close-source models, the fine-tuned model showed better performance across metrics, for instance: BLEU score (63% vs to 58% for the second-ranking model), BERT (89% vs to 86%), and semantic similarity (89% vs to 83%), CER (19% vs 20%). CONCLUSIONS: Whisper Large v3 and its fine-tuned variant are the best open-source STT models for transcribing medical conversations in LatAm Spanish. These findings provide an evidence base for developing AI medical scribes tailored to Spanish-speaking LatAm.

9

Transducin: an open-source pipeline recovering SNOMED-CT coded measurements from the undocumented Optopol .OPT and Zeiss Cirrus private-tag formats as DICOM Structured Reports

Jaurrieta Hinojos, J. N.; Palomares Ordonez, J. L.; Chacon Hinojos, J. F.; Folgueras Batres, M. A.

2026-07-17 ophthalmology 10.64898/2026.07.14.26357256 medRxiv

Top 2%

0.1%

Show abstract

Abstract Background. Quantitative optical coherence tomography (OCT) measurements are essential for retinal disease monitoring, yet leading vendors store acquisition data in undocumented proprietary formats or encode measurements exclusively in private DICOM tags inaccessible to open systems. Methods. We present Transducin, an open-source Python library that reverse-engineers the undocumented Optopol Revo FC130 and Revo 60 .OPT binary format and extracts quantitative measurements from Zeiss Cirrus HDOCT private DICOM tags, generating TID 1500 Structured Reports with SNOMEDCT coded findings for both platforms. A novel finding, that OCTPARAMS tag 23 encodes ocular laterality through the arithmetic sign of the foveal horizontal position, enables geometry based laterality inference requiring no operator data entry, validated across 18 files from two device models and four software versions with 100% accuracy. Results. The primary corpus of 452 Optopol .OPT files (73 patients, 7 acquisition types) was parsed with 100% success. Cross-version compatibility was confirmed across SOCT versions 11.5.0 through 21.5.0, spanning approximately eight years of software development. The Zeiss Cirrus pipeline generated TID 1500 SRs for all 41 applicable studies (100%), yielding CMT 203to 630um and RNFL 53 to123 um across a clinically representative range. Conclusions. Transducin provides the first publicly documented specification of the Optopol .OPT format and the first open-source multivendor pipeline generating SNOMEDCT coded DICOM Structured Reports from both Optopol Revo and Zeiss Cirrus devices, closing a gap explicitly confirmed by both manufacturers' own documentation. The code is available at https://github.com/oftalmos-org/transducin (Apache License 2.0).

10

Developing and Prospectively Validating a Reproducible Graph Representation Specification for Clinical Guideline Algorithms: The Measurement Foundation of the Clinical Guideline Complexity Index

Milani, R. V.; Bober, R. M.

2026-07-20 health informatics 10.64898/2026.07.17.26358358 medRxiv

Top 2%

0.1%

Show abstract

Background. Translating a clinical guideline decision algorithm into a computational graph requires judgment, and unconstrained coding yields divergent graphs; any complexity measure computed from such a graph inherits that variation, so its reproducibility must be demonstrated rather than assumed. Objective. To develop, and prospectively test, an empirical method for making graph extraction reproducible, using the Clinical Guideline Complexity Index (CGCI) and four guideline algorithms as a case study. Methods. We built a Graph Representation Specification (an ontology, a motif catalogue, disambiguation conventions, decomposition rules, a deterministic validator, and a scoring engine) and refined it by error-driven grammar induction: measure inter-coder disagreement, localize its dominant class, induce a single grammar rule, and prospectively test whether that rule improves agreement in the anticipated class. Reproducibility was quantified with a pre-specified, topology-based endpoint (Decision Topology Agreement) rather than edge agreement, which is oversensitive to representational choices that do not affect the score. Two trained coders independently coded the diabetes, dyslipidemia, heart-failure, and hypertension algorithms. Results. A rule induced from the diabetes comorbidity panel (assessment topology) generated a pre-specified prediction that heart-failure figures, sharing the same motif, would converge; on a fresh, independently coded pair they did, with an absolute CGCI difference of approximately one. Decision topology reproduced closely (decision-order agreement at or near 1.00 for three of four guidelines), while breadth counting was rule-sensitive: an explicit modifier-counting rule reduced the largest disagreement from 27 to 4 tokens. Residual disagreement was bounded and localizable to specific, nameable representational choices. Conclusions. Graph-extraction reproducibility can be systematically improved through iterative grammar refinement, and a prospectively derived rule can be confirmed to improve agreement. These results establish the measurement foundation (reliability, not construct validity) for a companion study interpreting CGCI as cognitive load, and the method may apply wherever graphs are extracted from structured source artifacts.

11

Design tensions in a two-sided marketplace for reusable digital therapeutics software components: a qualitative interview study

Kowatsch, T.; Melamed, S.; Nissen, M.; Merz, Y.

2026-07-20 health informatics 10.64898/2026.07.17.26358332 medRxiv

Top 2%

0.1%

Show abstract

Objectives To identify stakeholder-perceived design tensions in a two-sided marketplace for reusable digital therapeutics (DTx) software components and to use these tensions to propose alternative marketplace concepts. Methods We conducted 24 semi-structured interviews with digital health researchers and professionals. Data were analysed using hybrid deductive-inductive codebook thematic analysis. The Magic Triangle provided the initial deductive structure. One researcher coded all transcripts; a second independently applied the developing codebook to five transcripts to refine definitions and consistency. Seventeen parent themes were synthesized into 12 design tensions, which informed three author-generated marketplace concepts. Results Participants described trade-offs concerning target users and host, component scope and customization, quality labels, verification, geographic scope, pricing, interoperability, platform launch, risks and market niche. The resulting concepts emphasized a regional startup ecosystem, a research-oriented hybrid marketplace or a global marketplace with stricter entry requirements. Discussion The concepts combine the tensions in different ways and highlight competing priorities in governance, openness, assurance, scalability and early platform growth. Conclusion Stakeholders identified recurring design choices for a DTx software-component marketplace. The concepts provide hypotheses for prototyping and evaluation; the study did not test technical feasibility, market demand, regulatory acceptability or effects on development cost or time.

12

Bowel Irrigation Questionnaire Development of a Patient-Reported Experience Measure to assess the user experience of Transanal Irrigation

Farrow, E.; Balachandran, R.; Embleton, R.; Krogh, K.; Vollebregt, P. F.; Cornish, J.; Christensen, P.

2026-07-17 surgery 10.64898/2026.07.16.26358225 medRxiv

Top 2%

0.1%

Show abstract

Aims To develop the Bowel Irrigation Questionnaire (BIQ), a patient-reported experience measure (PREM) designed to assess the user experience of transanal irrigation (TAI). Methods Statements were generated through literature review and qualitative interviews with healthcare professionals (HCPs) and product users. Statements were rated on a 6-point content validity index scale through an international three-round online Delphi survey by 20 expert panel members. Consensus attainment was defined based on percentage agreement, statements which did not meet consensus were discussed at a final international online consensus meeting. The content validity of the PREM was evaluated through cognitive interviews and the Questionnaire on Questionnaires (QQ-10). Reliability was assessed using a test-retest design, where users completed the BIQ on two occasions one week apart. Results 215 statements were generated from 9 multi-disciplinary qualitative interviews and literature review. Statements were refined to reduce repetition and ensure clarity. 73 statements grouped into 11 domains were reviewed through the Delphi survey. Following the Delphi survey and clinical consensus meeting, the preliminary BIQ consisted of 15 items. Six cognitive interviews were conducted, resulting in a finalised BIQ of 16 items. 32 product users completed both the QQ-10 and test-retest study, the results of which demonstrated good content validity and temporal stability respectively. Conclusions The Bowel Irrigation Questionnaire is a novel PREM designed to assess the user experience of TAI in both clinical and research settings. The instrument demonstrates good validity, acceptability and temporal stability, supporting its use as a reliable measure of patient experience.

13

Machine learning and data-driven models for predicting post-stroke dysphagia: a systematic review and meta-analysis

Mohammadi Yazdi, S.; Motevaselian, M.; Khatami, S.; Radfar, N.; jourahmad, z.; Perez, H. A.

2026-07-17 neurology 10.64898/2026.07.15.26358113 medRxiv

Top 2%

0.1%

Show abstract

Background: Post-stroke dysphagia (PSD) contributes to aspiration, pneumonia, malnutrition, prolonged hospitalization and mortality. We evaluated the discrimination, validity and readiness of machine learning and data-driven prediction models for PSD-related outcomes. Methods: Following a prospectively registered protocol (PROSPERO CRD420261419259), we searched PubMed/MEDLINE, Embase, Web of Science Core Collection, CINAHL and CENTRAL from inception through June 7, 2026. Eligible studies developed or validated multivariable prediction models for PSD-related outcomes in adults with stroke. We used PROBAST and PROBAST+AI to assess risk of bias and applicability and TRIPOD+AI to evaluate reporting. Area under the curve (AUC) estimates were pooled on the logit scale with random-effects models. Results: Twenty-four studies were included and ten contributed to meta-analysis. Four studies predicting early or incident PSD yielded a pooled AUC of 0.94 (95% CI 0.60-0.99; I2 = 95.6%). Pooled AUCs were 0.84 (95% CI 0.71-0.92) for aspiration or penetration-aspiration and 0.89 (95% CI 0.24-1.00) for severe dysphagia. The exploratory analysis of all ten risk-prediction models produced an AUC of 0.90 (95% CI 0.80-0.95), but heterogeneity was substantial (I2 = 90.3%) and the prediction interval was 0.51-0.99. Every study had high risk of bias because of analysis-domain concerns; calibration and external validation were uncommon. Conclusions: Reported discrimination was often high, but the evidence does not establish reliable performance in care. Independent validation, calibration, complete model reporting and clinical-impact studies are needed before these models guide post-stroke swallowing care. Keywords: Post-stroke dysphagia; Stroke; Deglutition disorders; Machine learning; Clinical prediction model; Area under the curve; Meta-analysis

14

Effects of AI-driven Lifestyle Intervention on Psychological Well-Being and Body Image Among Young Adults In Malaysia

Najwa, A.; Azmi, I.; Zafran, A.; Adibah, N.; Zulkafli, H.; Iman, A.; Linoby, A.

2026-07-21 nutrition 10.64898/2026.07.20.26358442 medRxiv

Top 2%

0.1%

Show abstract

Background: University students experience substantial psychological well-being and body-image concerns, while scalable, personalized digital support remains underexamined in Malaysia. Artificial intelligence chatbots may deliver repeated lifestyle guidance, but the incremental value of personalization over structured chatbot support is uncertain. Objectives: This study evaluated changes in psychological well-being and body appreciation following a 12 week personalized AI-powered lifestyle intervention, NExGEN, among Malaysian university students. Methods: A two-arm, controlled, quasi-experimental pre-post study allocated 140 students aged 18 to 35 years by matched blocks to NExGEN (n = 70) or a structured-prompt ChatGPT control (n = 70). NExGEN generated adaptive weekly lifestyle actions from a 47-item onboarding assessment, whereas control participants received standardized weekly prompts covering the same lifestyle domains. Psychological well-being and body appreciation were assessed at baseline and week 12 using the World Health Organization-Five Well-Being Index and Body Appreciation Scale-2. Intention-to-treat linear mixed models estimated adjusted within-group changes and between-group differences in change, with Holm adjustment for the co-primary outcomes. Results: Week-12 assessments were completed by 121 participants (86.43%). In NExGEN, psychological well-being improved by an adjusted 8.68 points (95% CI, 6.22 to 11.14), z = 6.91, p < .001, and body appreciation improved by 0.17 points (95% CI, 0.10 to 0.24), z = 4.82, p < .001. However, between-group differences in change were not statistically significant for psychological well-being (2.87 points; 95% CI, -0.48 to 6.23; z = 1.68; Holm-adjusted p = .093) or body appreciation (0.10 points; 95% CI, 0.00 to 0.19; z = 1.99; Holm-adjusted p = .093). Median platform logins were 68.00 in NExGEN and 58.50 in control; mean acceptability scores were 3.92 and 3.59, respectively. Conclusions: NExGEN participation was associated with significant within-group improvements in psychological well-being and body appreciation, but personalized guidance did not demonstrate superiority over structured chatbot guidance. Because allocation was quasi-experimental, causal attribution remains limited. Randomized component-level trials are needed to determine whether personalization provides incremental benefit.

15

Validation of an Assessment Scale for a Low-Tech Laparoscopic Appendectomy Simulation and Its Relevance for Formative Self-Assessment

Tumameu Kouam, T. H.; Renoult, L.; Poitevin, M.; Jourdin, L.; Herve, C.; Meignan, P.; Podevin, G.; Schmitt, F.

2026-07-21 medical education 10.64898/2026.07.20.26358477 medRxiv

Top 2%

0.1%

Show abstract

Introduction: Laparoscopic appendectomy is an ideal procedure for acquiring laparoscopic skills through simulation. Nevertheless, technical training is time consuming for surgical trainers to provide constructive feedback, but this could be improved by the development of validated tools that enable appropriate formative self-assessment. For this reason, we developed a structured assessment scale for a laparoscopic appendectomy exercise using a low-fidelity simulator. The objective of this study was to validate the scale for use in formative self-assessment. Methods: During laparoscopic simulation sessions in 2025-2026, participants with varying levels of experience performed a standardized laparoscopic appendectomy (LAP) exercise on a low-fidelity simulator. Performance was assessed through formative self- and external assessment using a specific scale derived from the OSATS (Objective Structured Assessment of Technical Skills) score. Content and construct validity, internal consistency, reproducibility, and reliability in both hetero- and self-assessment were analyzed. Results: Thirty-two participants were included in the validation study of the LAP scale, including 7 medical students, 17 residents in pediatric, visceral, urological, and gynecological surgery, and 8 practicing surgeons. The content of the scale was deemed relevant by 80% of the users. It demonstrated excellent construct validity, with scores increasing according to level of experience: 9.9 +/- 0.7 among students, 12.7 +/- 3.3 among junior residents, 16.6 +/- 3.3 among experienced residents, and 18.8 +/- 0.9 among practicing surgeons (p < 0.0001). Reproducibility and internal consistency were significant, while inter and intrarater reliability were excellent (correlation coefficients r = 0.90 and 0.91; p < 0.0001), as was the correlation between external and self-assessment (r = 0.81; p < 0.0001). Self-assessment was more reliable among experienced learners than among novices. Conclusion: This standardized LAP scale is validated for both external and self-assessment, the latter requiring prior training to be reliable and formative.

16

From Menarche to Menopause: Hormonal Influences on Functional Neurological Disorder

Palmer, D. D. G.; Warren, N.; Morton, A.; Lehn, A.

2026-07-18 neurology 10.64898/2026.07.16.26358260 medRxiv

Top 2%

0.1%

Show abstract

Background Functional neurological disorder (FND), one of the most common neurological conditions, affects women almost twice as frequently as men. The reasons for this are unknown, and there has been minimal research into how physiological and pathological features of women's health interact with symptoms of FND. Methods We conducted an online survey assessing the effect of several aspects of women's health with the severity of symptoms of FND. Results 484 people completed the survey. Among the 223 who had regular or fairly regular menstrual cycles, a strong difference across the menstrual cycle was seen, with symptoms at their best in the follicular phase, worsening in the luteal phase, and worst in the pre-menstrual period and the menses. This effect was not moderated by a proxy measure of pre-menstrual dysphoric disorder (PMDD). Participants who were taking the combined oral contraceptive (COC, n=43) and progesterone-based contraception (n=80) were more likely to report symptom improvement from starting the medication than worsening. When compared to menstruating participants who were not taking the COC, participants taking the COC reported less worsening in their symptoms of FND in the luteal, pre-menstrual, and menstrual phases. Of the 99 women who had passed menopause since developing FND, 76% reported worsening of their FND symptoms after menopause. Discussion This study demonstrates interactions between several aspects of women's health and symptoms of FND. The observed pattern of symptom fluctuation across hormonal states suggests a potential modulatory role of oestrogen, warranting further targeted investigation.

17

Diagnostic analytics of routine Clinical Competency Committee data of six cohorts in family medicine program in the UAE, utilizing Milestones, EPA, and ITE

Baynouna Alketbi, L. M.; Nagelkerke, N.; Alzarouni, A.; AlKwuiti, M.

2026-07-16 medical education 10.64898/2026.07.13.26356644 medRxiv

Top 2%

0.1%

Show abstract

In Competency-based medical education (CBME), longitudinal data is generated continuously. The judgments a Clinical Competency Committee (CCC) makes about trainee learning and performance are a valuable resource, supporting both resident and program development. Such data as well can enables the evaluation of rating quality and of CBME instruments such as Milestones and Entrustable Professional Activities (EPAs) which can help address a gap in the CBME literature, where evidence on the performance of these instruments remains limited. Objective Routinely gathered CCC data of six cohorts in a four-training ACGME-I-accredited family medicine residency in Al Ain, United Arab Emirates, was studied to describe growth trajectories, rating-system behavior, and the concurrent agreement of CBME instruments. As well as investigating the prospective predictive validity of two CBME instruments, EPA and Milestones, and the In-Training Exam (ITE). Methods The longitudinal CCC data for 80 residents across six cohorts (2019-20 to 2024-25) were assessed at up to eight time points (mid- and end-year; R1-R4). The pooled dataset included 10,458 EPA item ratings across 334 resident-time points, 5,021 Competency Milestone item ratings across 285 resident-time points, and 185 ITE scores. Five research questions were examined: growth trajectories; within- and between-resident variation and straight-lining (identical scores on assessed items at a single time point); EPA-Milestones agreement; the validity of supervisor ratings against the ITE (anchoring diagnostic, same-year correlations, prospective regressions); and EPA blueprint fidelity (the mapping of EPAs against the ACGME-I subcompetency). Al Ain trajectories were benchmarked against an international family medicine reference. Results All three instruments rose steadily across the eight timepoints. By End R4, the Milestones mean (4.00, range 3.83-4.24) matched US end-of-training norms (3.84-4.02). With regards to rating quality, pooled R1-R3 Milestones straight-lining was 2.3% (EPA 0%), below US benchmarks; between-resident discrimination was preserved (SD 0.41-0.54); and longitudinal halo was ruled out (within-domain growth-slope r = 0.61 vs across-domain r = 0.37). End R1 Overall EPA was the strongest prospective predictor of Final Competency (B = 0.96, p < 0.001) and Final ITE (B = 96.88, p = .006). Medical Knowledge ratings were independent of prior ITE scores from Mid R2 onward, and End R2 MK ratings predicted ITE 17 months later at r=0.88, confirming supervisor judgment was not anchored to test results. With regards whether individual EPAs correlate with individual Milestone subcompetencies at each timepoint, a significant EPA and Milestones correlations were negligible at End R1 (1 of 222 item-level cells significant) and converged by End R3 (36 cells), while resident-mean stepwise regressions showed the two instruments (EPA and Milestones) behaved as overlapping predictors throughout, indicating that EPAs and Milestones are complementary at the level of specific content but convergent at the level of aggregate resident judgment. Blueprint fidelity rose from 30% of cells reaching r [≥] 0.40 at End R2 to 80% at End R3 in the same cohort, indicating that apparent fidelity is materially affected by measurement timing. Conclusion By graduation, residents demonstrated substantial and progressive competency achievement across both instruments, with the majority reaching the entrustable threshold on both EPA and Milestone ratings. The rating system demonstrated disciplined assessment behavior of supervisors and both concurrent and prospective validity relative to the ITE. Overall EPA at End R1 was the strongest prospective predictor of all three terminal outcomes, final ITE score, graduating Competency Milestones, and graduating overall EPA, outperforming Milestones and baseline knowledge. Routine CCC data support an evidence-based quality assurance framework spanning rater-process diagnostics, outcome-validity diagnostics, and the asymmetric-instrument diagnostic, requiring no additional data collection beyond existing program processes.

18

Development of a Functional Needs Assessment Tool to estimate population level functional difficulties and need for services and assistive products

Boggs, D.; Birabwa, A.; Adkins, S.; Atijosan-Ayodele, O.; Bulathwela, S.; de Cates, C.; Foster, A.; Kuper, H.; Holloway, C.; Mugisha, J.; Polack, S.

2026-07-17 public and global health 10.64898/2026.07.10.26357317 medRxiv

Top 2%

0.1%

Show abstract

Background: Globally, at least 2.6 billion people need rehabilitation services and more than 2.5 billion people need assistive technology (AT). However, reliable data are lacking on population level need for rehabilitation services and assistive products (AP) in different settings for evidence-based policy and programme planning. This first study paper describes the development of the Functional Needs Assessment Tool (FNAT), a new survey tool developed to fill this data gap between 2018 and 2023. Objective: To develop a new multidomain tool to assess population-level functional difficulties and need for service and AP utilising both self-report and clinical assessment methodologies. Development stages: FNAT was developed based upon primary and secondary data analysis, existing survey tools and expert consultation through a series of four steps: Step 1 Inform, Step 2 Build, Step 3 Draft and Step 4 Develop. FNAT uses both self-reported and clinical assessment tools to estimate the prevalence of functional difficulties/impairment and the need for services and AP in the following seven domains: vision, hearing, mobility, communication, cognition, self-care and mental health. It uses a two-stage population-based assessment with data collection through a bespoke tablet-based mobile application and web-based platform. Discussion: FNAT is a new multi-domain modular tool developed to address data gaps by estimating prevalence of functional difficulties and service/AP needs in a population. Potential advantages and disadvantages were highlighted during the development stages, and the tool needs to be pilot tested to assess the feasibility of the methodology and the functionality of the tablet-based mobile data collection application.

19

Validity and Test-Retest Reliability of the Hume Pod Bioimpedance Analyzer for Body Composition Assessment

Tinsley, G. M.; Velasquez, C. M.; Florez, C. M.; Way, A. E.; Sullivan, M. H.; Whitson, J. A.; Rudolph, R. A.; Alexander, J. R.; Malladi, A.

2026-07-20 nutrition 10.64898/2026.07.17.26358337 medRxiv

Top 2%

0.1%

Show abstract

Consumer-grade bioelectrical impedance analyzers have become widely used for body composition assessment, yet their accuracy varies considerably across devices. The Hume Pod is a popular consumer-grade analyzer marketed as being highly accurate, but independent validation is lacking. The purpose of this study was to evaluate the reliability and validity of the Hume Pod relative to both a four-compartment (4C) model and dual-energy X-ray absorptiometry (DXA). Sixty-seven adults (42 females, 25 males; age 37.2 +/- 13.5 years, body mass index: 24.6 +/- 4.9 kg/m2, body fat percentage [BF%]: 26.4 +/- 10.2%) completed duplicate Hume Pod assessments alongside DXA and 4C evaluations. Reliability was evaluated using the technical error of measurement (TEM) and intraclass correlation coefficients (ICC). Validity was assessed using equivalence testing, Lin's concordance correlation coefficient (CCC), standard error of the estimate (SEE), Bland-Altman analysis, and additional tests. The Hume Pod demonstrated strong reliability, with ICCs >/= 0.993 and TEMs of 0.8% for BF% and 0.6 kg for fat mass (FM) and fat-free mass (FFM). Relative to the 4C model, BF%, FM, and FFM estimates were statistically equivalent (all p<0.05), with strong agreement (CCC=0.95-0.98), low SEE values (3.1%, 2.3 kg, and 2.2 kg, respectively), moderate limits of agreement (+/-6.1%, +/-4.5 kg, and +/-4.5 kg), and no proportional bias. Compared with DXA, generally strong agreement was also observed. These findings indicate that the Hume Pod demonstrates strong reliability and validity compared with laboratory reference methods for body composition estimation, supporting its potential use as a consumer body composition assessment.

20

Large Language Model - Enhanced Decision Tree Framework for Identifying Multiple Sclerosis Diagnoses from Clinical Documentation

Venkatesh, S.; DelSignore, M.; Wu, X.; Morris, M.; Kerr, W. T.; Visweswaran, S.; Wang, Y.; Xia, Z.

2026-07-17 neurology 10.64898/2026.07.14.26357416 medRxiv

Top 3%

0.1%

Show abstract

Background. Early diagnosis and intervention are crucial in multiple sclerosis (MS), yet diagnostic delays are common. Large language models (LLMs) such as generative pre-trained transformers (GPTs) may help streamline diagnostic workflows by extracting MS diagnostic signals from clinical notes. Objective. To derive MS diagnosis status from the first neurology note using a computable algorithm based on the 2017 McDonald criteria and applying GPT-4 for node-level reasoning within a structured decision framework. Methods. We analyzed first neurology notes from 125 randomly selected patients (including those with MS, related disorders, and controls) enrolled in a clinic cohort between 2017 and 2023. We included the clinical history and diagnostic testing sections but redacted the assessment and plan. We converted the 2017 McDonald criteria into a decision tree and provided expert-curated clinical knowledge to guide GPT-4 reasoning at each decision node. GPT-4 generated binary decisions at each node to traverse the tree and classified MS diagnoses at terminal nodes. We evaluated performance against neurologist-assessed diagnoses and characterized hallucinations (non-factual, incongruent, irrelevant, over-reliant, and logical reasoning errors). Results. In this study cohort (mean age 40{+/-}13 years; 81% women) representative of the clinic population, GPT-4 performed well in predicting MS diagnosis (84% accuracy, 79% precision, 74% recall, 91% specificity) using first neurology notes. Hallucinations occurred in 32 cases (26%), most commonly incoherence (75%) and overreliance (47%). Conclusion. A structured, LLM-guided decision framework can flag MS diagnoses from early clinical documentation. Large-scale studies are needed to mitigate hallucinations, validate this approach, and test implementation in clinical settings.